xxxxxxxxxxxxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:20px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b>1 |</b> Business Objective</p></div>In this notebook, I analysed GetSafe insurtech company's data. The data consists of 6,000 policyholders with 11 attributes that describe their region, sales channel, platform, and insured coverage information, including their age, and product type. The variable we would like to predict is the customer conversion, `ages`, that were traced by GetSafe.<br> The first aim of this analysis is to find interesting insights to grow with our customers. For this purpose, I applied exploratory data analysis after featuring and cleaning the data. The second aim is to identify the customer classification that leads to a higher conversion rate. Getsafe marketing team had already built certain prooxy matrics for acquiring young age customers. To test the modification to an app feature for the mobile app and website app, I utlised the A/B test to measure the target of the conversion rate for the more likely customer to be converted.In this notebook, I analysed GetSafe insurtech company's data. The data consists of 6,000 policyholders with 11 attributes that describe their region, sales channel, platform, and insured coverage information, including their age, and product type. The variable we would like to predict is the customer conversion, ages, that were traced by GetSafe.
The first aim of this analysis is to find interesting insights to grow with our customers. For this purpose, I applied exploratory data analysis after featuring and cleaning the data. The second aim is to identify the customer classification that leads to a higher conversion rate. Getsafe marketing team had already built certain prooxy matrics for acquiring young age customers. To test the modification to an app feature for the mobile app and website app, I utlised the A/B test to measure the target of the conversion rate for the more likely customer to be converted.
xxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:20px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b>2 |</b> Data Overview </p></div>import numpy as npimport pandas as pdimport missingno import plotly from plotly import offlinefrom plotly import graph_objs as gofrom plotly import express as pximport seaborn as snssns.set(color_codes=True)import matplotlib.pyplot as pltimport datetimefrom scipy.stats import chi2_contingency, beta from IPython.display import Imageimport statsmodels.stats.api as smsfrom math import ceilimport scipy.stats as statsdf=pd.read_csv("converted_data.csv")df.sample(5)print("There are {:,} observations and {} columns in the data set.".format(df.shape[0], df.shape[1]))df.columns df.dtypesdf.describe()xxxxxxxxxxThe three columns ; which are "area_classification", "most_used_os" and "buying platform", have a missing entries. Further more the maximum age is 2013 which is wrong.The three columns ; which are "area_classification", "most_used_os" and "buying platform", have a missing entries. Further more the maximum age is 2013 which is wrong.
missingno.bar(df,color="#AFE8BB",fontsize=30)C = (df.dtypes == 'object')CategoricalVariables = list(C[C].index)Integer = (df.dtypes == 'int64') Float = (df.dtypes == 'float64') NumericVariables = list(Integer[Integer].index) + list(Float[Float].index)Missing_Percentage = (df.isnull().sum()).sum()/np.product(df.shape)*100print("The number of missing entries before cleaning: " + str(round(Missing_Percentage,2)) + " %")plt.boxplot(df["age"])plt.show()xxxxxxxxxxThe output shows that one outlier in age is higher tahn the maximum age for the policy holder;moreover,NaN's corrections are needed, so the next section is dedicted for cleaning the data.The output shows that one outlier in age is higher tahn the maximum age for the policy holder;moreover,NaN's corrections are needed, so the next section is dedicted for cleaning the data.
xxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:20px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b>3 |</b> Data Cleaning </p></div>xxxxxxxxxx* Deleting the maximum age 2013 which is wrong.index = df[(df['age'] >= 100)].indexdf.drop(index, inplace=True)df['age'].describe()xxxxxxxxxx* Removing all the rows that contain a missing value#checking the null valuesdf.isnull().sum()#removing the null valuesdf.dropna(axis="rows", how="any", inplace = True)df.reset_index(drop=True, inplace=True)missingno.bar(df,color="#AFE8BB",fontsize=30)#checking the null values after cleaning df.isnull().sum()xxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:20px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b>4.1 |</b> Summary Statistics of Numeric Columns </p></div>df.describe().Txxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:20px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b>4.2 |</b> Summary Statistics of Categorical Columns </p></div>df.select_dtypes(include=['object']).describe()xxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:20px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b>5 |</b> Exploratory Data Analysis</p></div>#Preparing for visualisation by using Pie shape. It represents data visually as a fractional part of a whole.def create_pie(name,title,text): """ creating Pie chart using plotly with 3 argument name,title,text""" labels=name.index values=name.values trace0=go.Pie(values=values,labels=labels,hole=0.4,textposition="inside",textinfo="label+percent+value") data=[trace0] layout=dict(title=title,title_x=0.5,annotations=[dict(text=text,x=0.5,y=0.5,showarrow=False,font_size=14)]) fig=dict(data=data,layout=layout) offline.iplot(fig)#Preparing for visualisation by using sunburst shape.Sunburst Visualization is ideal for displaying hierarchical data. The visualization design is made up of an inner circle surrounded by rings of deeper hierarchy levels. The angle of each segment is either proportional to a value or divided equally under its parent node.def create_sunburst(df,path:list,values,title): """ Creating sunburst chart using plotly python""" fig=px.sunburst(df,path=path,values=values) fig.update_layout(title=title,title_x=0.5) fig.show()xxxxxxxxxxAs the wrong maximum age had been deleted at cleaning data section, so currently The lower value of 18 and the maximum value of 91; as a result, the typical age range is between 18 and 91.As the wrong maximum age had been deleted at cleaning data section, so currently The lower value of 18 and the maximum value of 91; as a result, the typical age range is between 18 and 91.
plt.boxplot(df["age"])plt.show()sns.displot(df['age'],color="#AFE8BB")xxxxxxxxxxThis shows there are no observations outside the 18-91 range. Therefore, it is a proper action to cap the values. A column for the age bucket is added to enhance the focus on this significant variable.This shows there are no observations outside the 18-91 range. Therefore, it is a proper action to cap the values. A column for the age bucket is added to enhance the focus on this significant variable.
#Feature Engineering df.loc[(df["age"]>=10)&(df["age"]<=19),"in_age"]="in 10's"df.loc[(df["age"]>=20)&(df["age"]<=29),"in_age"]="in 20's"df.loc[(df["age"]>=30)&(df["age"]<=39),"in_age"]="in 30's"df.loc[(df["age"]>=40)&(df["age"]<=49),"in_age"]="in 40's"df.loc[(df["age"]>=50)&(df["age"]<=59),"in_age"]="in 50's"df.loc[(df["age"]>=60)&(df["age"]<=69),"in_age"]="in 60's"df.loc[(df["age"]>=70)&(df["age"]<=79),"in_age"]="in 70's"df.loc[(df["age"]>=80)&(df["age"]<=89),"in_age"]="in 80's"df.loc[(df["age"]>=90)&(df["age"]<=99),"in_age"]="in 90's"df.sample(5)df.info()df.columnsdf.describe()#By productdf["first_product"].unique()product=df["first_product"].value_counts(sort=True)productcreate_pie(product,"Product Distribution","Product") xxxxxxxxxxThe Liability Insurance prodict records the highest sales by 53.8% , then Content insurance product by 19.9%.The Liability Insurance prodict records the highest sales by 53.8% , then Content insurance product by 19.9%.
df.groupby("first_product").agg(Minimum_age=("age","min"), Maximum_Age=("age","max"), Average_Age=("age","mean")).reset_index() # By Agedf["age"].min()df["age"].max()age=df.groupby("in_age").agg(Frequency=("in_age","count")).reset_index()agecreate_sunburst(age,["in_age","Frequency"],"Frequency", "Age Distribution")xxxxxxxxxxGetSafe's potential cusotmer are mainly in 20s and 30s .GetSafe's potential cusotmer are mainly in 20s and 30s .
product_vs_inage=df.groupby(["first_product","in_age"]).agg(Frequency=("first_product","count")).reset_index()product_vs_inagecreate_sunburst(product_vs_inage,["first_product","in_age","Frequency"], "Frequency","Product Age Frequency")xxxxxxxxxxThe number of insured persons under a personal liability policy is the highest; especially, the young insured who age in their 20s. According to a survey conducted by Evgenia Koptyug (2022), the statistics show that the results of the survey conducted in Germany on living situations among the population in 2021, broken down by age group that 77.5% of 20-to-29-year olds are rentals. It is obvious that GetSafe smartly succeeded to target this age group through its liability product. <br> <br>However, 46.2% of 14 to 19 years old are rentals in Germany according to the same survey, and this percentage is quite high; therefore, GetSafe has a great opportunity to target this age group and increase its sales of personal liability product, and this can be done by a marketing campaign and adding new features meeting the customer's need who are between 14 to 19 years old; as a result, that will be positively affect the GetSafe's growth. The number of insured persons under a personal liability policy is the highest; especially, the young insured who age in their 20s. According to a survey conducted by Evgenia Koptyug (2022), the statistics show that the results of the survey conducted in Germany on living situations among the population in 2021, broken down by age group that 77.5% of 20-to-29-year olds are rentals. It is obvious that GetSafe smartly succeeded to target this age group through its liability product.
However, 46.2% of 14 to 19 years old are rentals in Germany according to the same survey, and this percentage is quite high; therefore, GetSafe has a great opportunity to target this age group and increase its sales of personal liability product, and this can be done by a marketing campaign and adding new features meeting the customer's need who are between 14 to 19 years old; as a result, that will be positively affect the GetSafe's growth.
# By Region df["area_classification"].unique()region=df["area_classification"].value_counts(sort=True)regioncreate_pie(region,"Region Classification","Region")xxxxxxxxxxAs it is shown, the rural area records the highest number of the policies for GetSafe ,after it comes the Big 7 areas.As it is shown, the rural area records the highest number of the policies for GetSafe ,after it comes the Big 7 areas.
df_product_Area=df['area_classification'].value_counts().to_frame().reset_index().rename(columns={'index':'area_classification','area_classification':'count'})fig = go.Figure(data=[go.Scatter( x=df_product_Area['area_classification'], y=df_product_Area['count'], mode='markers', marker=dict( color=df_product_Area['count'], size=df_product_Area['count']*0.070, # Multiplying by 0.3 to reduce size and stay uniform accross all points showscale=True ))])fig.update_layout(title='Area classification',xaxis_title="Regions",yaxis_title="Frequency Of Area ",title_x=0.5)fig.show()Product_vs_region=df.groupby(["area_classification","first_product"]).agg(Product_Count=("first_product","count")).reset_index()Product_vs_regioncreate_sunburst(Product_vs_region,["area_classification","first_product","Product_Count"], "Product_Count","Product Region and Count.")inage_vs_region=df.groupby(["first_product","in_age","area_classification"]).agg(frequency=("area_classification","count")).reset_index()inage_vs_regioncreate_sunburst(inage_vs_region,["first_product","in_age","area_classification","frequency"] ,"frequency","Product age and region.")fig = px.scatter(inage_vs_region, x='frequency', y='in_age', color='area_classification', size='frequency', title="The frequency of the issued policies increases with rural and big 7 areas among 20's and 30's", color_discrete_sequence=['#AFE8BB','#003C2B'],height=600)fig.update_layout(legend=dict(title='',orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1), font_color="#303030", xaxis=dict(title='Policies frequencies',showgrid=False), yaxis=dict(title='Age Group',showgrid=False, zerolinecolor='#E5E5EA', showline=True, linecolor='#E5E5EA', linewidth=2))fig.show()xxxxxxxxxxFor each region, the sunburst shows the distribution of the insurance products in the regions. Liability and content insurance policy reached the highest sales in rural and big 7 areas.The main difference between personal liability and content insurance is that a content insurance policy covers your belongings. If it doesn’t include tenants’ liability, it won’t cover any damage to your landlord’s belongings.<br> If the customer damages his/her belongings, it won’t affect the deposit return on the customer's rental. But the damage to the landlord’s belongings could cause problems when the customer wants his deposit back at the end of the tenancy period. This is where tenants’ liability insurance could help.<br><br> As shown above the interrelation between personal liability and content policy is quite strong, and that can create growth opportunities through cross-selling and making marketing campaigns highlighting the coverage of each policy and why the tenant needs both policies to be protected. Moreover, the GetSafe website and mobile platform had already been developed to embed the suggestion of the personal liability policies in its two types which are comfort and premium plans when the customer tries to buy a liability policy. As a result, the visitor of the website gets exposure to the two plans simultaneously. In addition to the region and product type, the young ages are positively correlated with the type of the products. Among the multiple types of products, the frequency of issued policies tends to increase with certain regions.Christian Wiens (CEO) was interviewed in April 2021 when he talked about the company's expansion plan in the UK, he expressed the company's aim which is "the insurance age" to replicate the success that it had in Germany, and the above analysis showing that the company's KPI is achieved. <br><br>However, the analysis shows that the highest sales are reached by only two insurance policies personal liability and content policy. In this case, GetSafe company needs to develop other insurance products such as dental, travel, dog, and accident. <br><br> By selecting dental insurance policy as an example for growth outside (liability and content product), we need to look at the other insurtech companies which are presented with the same type of product (dental Insurance) but with different prices and features. The below table comparing between GetSafe and the other competitors which are Ottonova and Feather. First, we need to understand the potential of the others, and by concentrating on the price and features of the plans. Obviously, Ottonova presents better prices and multiple features of its plan such as Economy, business, or first class. So GetSafe needs to reconsider either its price or the features of its plan. Assuming, GetSafe decided to reconsider its price, It is recommended that the underwriting team needs to discuss this issue with the actuary and the reinsurance company which is Munich Re for GetSafe. On the other hand, if the GetSafe decided to develop its features of the plan, It is recommended that the company replicate the same product development strategy which was applied in GetSafe personal liability policy which has multiple features of plans that are "Comfort plan" and "Premium plan", and this variation in the features of the plan has not been applied on GetSafe dental policy though it approved its success with personal liability policy as it is shown in the above analysis. That is exactly what makes Ottonova's growth faster than GetSafe. In conclusion, It is recommended the product development team has to create new variations for the features of the dental plan.For each region, the sunburst shows the distribution of the insurance products in the regions. Liability and content insurance policy reached the highest sales in rural and big 7 areas.
The main difference between personal liability and content insurance is that a content insurance policy covers your belongings. If it doesn’t include tenants’ liability, it won’t cover any damage to your landlord’s belongings.
If the customer damages his/her belongings, it won’t affect the deposit return on the customer's rental. But the damage to the landlord’s belongings could cause problems when the customer wants his deposit back at the end of the tenancy period. This is where tenants’ liability insurance could help.
As shown above the interrelation between personal liability and content policy is quite strong, and that can create growth opportunities through cross-selling and making marketing campaigns highlighting the coverage of each policy and why the tenant needs both policies to be protected. Moreover, the GetSafe website and mobile platform had already been developed to embed the suggestion of the personal liability policies in its two types which are comfort and premium plans when the customer tries to buy a liability policy. As a result, the visitor of the website gets exposure to the two plans simultaneously.
In addition to the region and product type, the young ages are positively correlated with the type of the products. Among the multiple types of products, the frequency of issued policies tends to increase with certain regions.
Christian Wiens (CEO) was interviewed in April 2021 when he talked about the company's expansion plan in the UK, he expressed the company's aim which is "the insurance age" to replicate the success that it had in Germany, and the above analysis showing that the company's KPI is achieved.
However, the analysis shows that the highest sales are reached by only two insurance policies personal liability and content policy. In this case, GetSafe company needs to develop other insurance products such as dental, travel, dog, and accident.
By selecting dental insurance policy as an example for growth outside (liability and content product), we need to look at the other insurtech companies which are presented with the same type of product (dental Insurance) but with different prices and features. The below table comparing between GetSafe and the other competitors which are Ottonova and Feather. First, we need to understand the potential of the others, and by concentrating on the price and features of the plans. Obviously, Ottonova presents better prices and multiple features of its plan such as Economy, business, or first class. So GetSafe needs to reconsider either its price or the features of its plan. Assuming, GetSafe decided to reconsider its price, It is recommended that the underwriting team needs to discuss this issue with the actuary and the reinsurance company which is Munich Re for GetSafe. On the other hand, if the GetSafe decided to develop its features of the plan, It is recommended that the company replicate the same product development strategy which was applied in GetSafe personal liability policy which has multiple features of plans that are "Comfort plan" and "Premium plan", and this variation in the features of the plan has not been applied on GetSafe dental policy though it approved its success with personal liability policy as it is shown in the above analysis. That is exactly what makes Ottonova's growth faster than GetSafe. In conclusion, It is recommended the product development team has to create new variations for the features of the dental plan.
# By Sales Chaneldf["buying_platform"].unique()platform=df["buying_platform"].value_counts(sort=True)platformcreate_pie(platform,"Platform Distribution","Platform")xxxxxxxxxxIt is interesting to see that the web platform are performing slightly better than the mobile Application.It is interesting to see that the web platform are performing slightly better than the mobile Application.
age_vs_platform=df.groupby(["in_age","buying_platform"]).agg(platform_count=("buying_platform","count")).reset_index()age_vs_platformcreate_sunburst(age_vs_platform,["in_age","buying_platform","platform_count"], "platform_count","Platform Vs. Age")xxxxxxxxxxPrecisely , the cusotmers in 20s are equaly using the webiste and the mobile application though it was expected the mobile application will be higher than the website . This can be to a reason that the Getsafe website was smoothly designed to meet the expectations of the 20s generation.Precisely , the cusotmers in 20s are equaly using the webiste and the mobile application though it was expected the mobile application will be higher than the website . This can be to a reason that the Getsafe website was smoothly designed to meet the expectations of the 20s generation.
df.sample(5)Channel_vs_product=df.groupby(["channel","first_product"]).agg(chanel_frequency=("channel","count")).reset_index()Channel_vs_productcreate_sunburst(Channel_vs_product,["channel","first_product","chanel_frequency"],"chanel_frequency","Chanel Vs Product")xxxxxxxxxxThe significance of the sales channel for GetSafe has been highlighted by the CEO when he emphasised the impact of the Covid-19 crisis on a new business; hence, he said "the customers reconsidering whether or not to make personal appointments with a broker. They are switching to digital insurers and discovering the benefits. Many of them will not switch back after the crisis is over. That is the greatest danger for insurers – and the greatest opportunity for insurgents".Based on the analysis, GetSafe succeeded to make its biggest sales channel for its policy the aggregators, and half of Getsafe sales is via this digital channel.For GetSafe, aggregators have become, in effect, the customer-facing side of the business. This change has led to a strong networking effect: rising use leads more product providers to employ aggregators as a sales channel while increasing market coverage attracts more users. Increased coverage also leads to better conversion rates, effectively driving down acquisition costs for the aggregator.### Recommended actions for GetSafe's Growth TeamThe medium to the long-term growth story of the B2C E-commerce industry in Germany promises to be attractive. The B2C E-commerce is expected to grow steadily over the forecast period, recording a CAGR of 13.62% during 2022-2026. The country's B2C e-commerce Gross Merchandise Value will increase from 46,933 million USD to 93,843.4 million by 2026. Reference to McKinsey's report states that Germany has 60% of its retail banking insurance products depend on aggregators.As a result, I recommend that the growth team and Data team has to name the types of aggregators that deal with like price-comparison, lead-generation only, broker, and product provider; hence, this will not only improve the result of the data analysis but also will create a mapping plan to the growth team to start to write a new strategy of selling the other products via aggregators because we have seen that the majority of the sales via aggregator is the liability product.The significance of the sales channel for GetSafe has been highlighted by the CEO when he emphasised the impact of the Covid-19 crisis on a new business; hence, he said "the customers reconsidering whether or not to make personal appointments with a broker. They are switching to digital insurers and discovering the benefits. Many of them will not switch back after the crisis is over. That is the greatest danger for insurers – and the greatest opportunity for insurgents".
Based on the analysis, GetSafe succeeded to make its biggest sales channel for its policy the aggregators, and half of Getsafe sales is via this digital channel.For GetSafe, aggregators have become, in effect, the customer-facing side of the business. This change has led to a strong networking effect: rising use leads more product providers to employ aggregators as a sales channel while increasing market coverage attracts more users. Increased coverage also leads to better conversion rates, effectively driving down acquisition costs for the aggregator.
The medium to the long-term growth story of the B2C E-commerce industry in Germany promises to be attractive. The B2C E-commerce is expected to grow steadily over the forecast period, recording a CAGR of 13.62% during 2022-2026. The country's B2C e-commerce Gross Merchandise Value will increase from 46,933 million USD to 93,843.4 million by 2026. Reference to McKinsey's report states that Germany has 60% of its retail banking insurance products depend on aggregators.
As a result, I recommend that the growth team and Data team has to name the types of aggregators that deal with like price-comparison, lead-generation only, broker, and product provider; hence, this will not only improve the result of the data analysis but also will create a mapping plan to the growth team to start to write a new strategy of selling the other products via aggregators because we have seen that the majority of the sales via aggregator is the liability product.
df.sample(4)df.groupby(["buying_platform","most_used_os"]).agg(Minimum_age=("age","min") ,Maximum_age=("age","max"), Average_age=("age","mean")).reset_index()df.sort_values(by="age",ascending=False).groupby("first_product").first().reset_index()xxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:20px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b>5 |</b> A/B testing for GetSafe marketing Campaign</p></div>xxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:15px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b> Table of contents :<br>Part I — Data Overview<br>Part II — Feature Engineering<br>Part III — Data Overveiw<br>Part IV — Designing Our Experiment<br>Part V — Choosing the variables<br>Part VI — Sampling<br>Part VII — Visualising the result<br>Part VIII — Testing the hypothesis<br>Part IX — Drawing conclusions</p></div>
Part I — Data Overview
Part II — Feature Engineering
Part III — Data Overveiw
Part IV — Designing Our Experiment
Part V — Choosing the variables
Part VI — Sampling
Part VII — Visualising the result
Part VIII — Testing the hypothesis
Part IX — Drawing conclusions
xxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:10px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b>Part I |</b> Data Overview</p></div>xxxxxxxxxxdf1=pd.read_csv("converted_data.csv")xxxxxxxxxxdf1.sample(4)xxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:10px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b>Part II |</b> Feature Engineering</p></div>xxxxxxxxxx# Adding user ID for each entry since we assume that all the customer are uniquedf1['user_id'] = np.arange(len(df1))df1.sample(4)xxxxxxxxxxdf1['user_id'].describe()xxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:10px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b>Part III |</b> Data Overview</p></div>xxxxxxxxxxstart_time = datetime.datetime.strptime(df1['joined_getsafe_at'].min(),'%Y-%m-%d %H:%M:%S')end_time = datetime.datetime.strptime(df1['joined_getsafe_at'].max(),'%Y-%m-%d %H:%M:%S')data_duration = (end_time - start_time).daysprint(f"Number of unique users in experiment: {df1['user_id'].nunique()}")print(f"Data collected for {data_duration} days")print(f"Matrix to compare: {df1['variant'].unique().tolist()}")print(f"Percentage of users in Control: {round(df1[df1['variant']== 'A'].shape[0] * 100 / df1.shape[0])}%")xxxxxxxxxxSince all the users are unique , so no need to get the timestamp of the first conversion;therofore, the following steps will not be applied: <br> 1- Get timestamp of first exposure <br> 2- Remove users with multiple bucketsSince all the users are unique , so no need to get the timestamp of the first conversion;therofore, the following steps will not be applied:
1- Get timestamp of first exposure
2- Remove users with multiple buckets
xxxxxxxxxxsns.boxplot(x="variant", y="converted", data=df1,color="#003C2B");plt.show()xxxxxxxxxxcounter = df1['user_id'].value_counts()(counter > 1).value_counts()xxxxxxxxxx0% user_id has been exposed to the both control and treatment,so all the users are retained for the experiment. 0% user_id has been exposed to the both control and treatment,so all the users are retained for the experiment.
xxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:10px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b>Part IV |</b> Designing our experiment</p></div>xxxxxxxxxxThe GetSafe's marketing team built proxy matric for new feature of app to acquire more younge age , and the product development team worked on a new version of app feature, with the hope that it will lead to a higher conversion rate. By assumaing that the product manager (PM) told us that the current conversion rate is about 7% on average throughout the year, and that the team would be happy with an increase of 3%, meaning that the new design will be considered a success if it raises the conversion rate to 10%.The GetSafe's marketing team built proxy matric for new feature of app to acquire more younge age , and the product development team worked on a new version of app feature, with the hope that it will lead to a higher conversion rate. By assumaing that the product manager (PM) told us that the current conversion rate is about 7% on average throughout the year, and that the team would be happy with an increase of 3%, meaning that the new design will be considered a success if it raises the conversion rate to 10%.
xxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:10px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b>Part V |</b> Choosing variables</p></div>For our test we’ll need two groups under Variant column:* A control group - They'll be shown the old design* B treatment (or experimental) group - They'll be shown the new designFor our test we’ll need two groups under Variant column:
xxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:10px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b>Part VI |</b> Sampling</p></div>xxxxxxxxxxeffect_size = sms.proportion_effectsize(0.07, 0.10) # Calculating effect size based on our expected ratesrequired_n = sms.NormalIndPower().solve_power(effect_size, power=0.8, alpha=0.05,ratio=1) # Calculating sample size neededrequired_n = ceil(required_n) # Rounding up to next whole number print(required_n)xxxxxxxxxxcontrol_sample = df1[df1['variant'] == 'A'].sample(n=required_n, random_state=22)treatment_sample = df1[df1['variant'] == 'B'].sample(n=required_n, random_state=22)ab_test = pd.concat([control_sample, treatment_sample], axis=0)ab_test.reset_index(drop=True, inplace=True)ab_testxxxxxxxxxxab_test['variant'].value_counts()xxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:10px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b>Part VII |</b> Visualising Results</p></div>xxxxxxxxxxconversion_rates = ab_test.groupby('variant')['converted']std_p = lambda x: np.std(x, ddof=0) # Std. deviation of the proportionse_p = lambda x: stats.sem(x, ddof=0) # Std. error of the proportion (std / sqrt(n))conversion_rates = conversion_rates.agg([np.mean, std_p, se_p])conversion_rates.columns = ['conversion_rate', 'std_deviation', 'std_error']conversion_rates.style.format('{:.3f}')xxxxxxxxxxJudging by the stats above, it does look like our two designs performed very similarly, with our new design performing slightly better, approx. 50.6% vs. 45.6% conversion rate.Judging by the stats above, it does look like our two designs performed very similarly, with our new design performing slightly better, approx. 50.6% vs. 45.6% conversion rate.
xxxxxxxxxxThe conversion rates for our groups are indeed very close. Also note that the conversion rate of the control group is lower than what we would have expected given what we knew about our avg. conversion rate (51% vs. 46%). This goes to show that there is some variation in results when sampling from a population.The conversion rates for our groups are indeed very close. Also note that the conversion rate of the control group is lower than what we would have expected given what we knew about our avg. conversion rate (51% vs. 46%). This goes to show that there is some variation in results when sampling from a population.
xxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:10px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b>Part VIII |</b> Testing the hypothesis</p></div>xxxxxxxxxxThe last step of our analysis is testing our hypothesis. Since the sampel is a very large , so the normal approximation for calculating our p-value is used .The last step of our analysis is testing our hypothesis. Since the sampel is a very large , so the normal approximation for calculating our p-value is used .
xxxxxxxxxxfrom statsmodels.stats.proportion import proportions_ztest, proportion_confintcontrol_results = ab_test[ab_test['variant'] == 'A']['converted']treatment_results = ab_test[ab_test['variant'] == 'B']['converted']n_con = control_results.count()n_treat = treatment_results.count()successes = [control_results.sum(), treatment_results.sum()]nobs = [n_con, n_treat]z_stat, pval = proportions_ztest(successes, nobs=nobs)(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(successes, nobs=nobs, alpha=0.05)print(f'z statistic: {z_stat:.2f}')print(f'p-value: {pval:.3f}')print(f'ci 95% for control group: [{lower_con:.3f}, {upper_con:.3f}]')print(f'ci 95% for treatment group: [{lower_treat:.3f}, {upper_treat:.3f}]')xxxxxxxxxx# <div style="color:white;display:fill;border-radius:5px;background-color:#AFE8BB;overflow:hidden"><p style="padding:10px;color:#003C2B;overflow:hidden;font-size:100%;margin:0"><b>Part IX |</b> Drawing conclusions</p></div> xxxxxxxxxxSince our p-value=0.90% is below our α=0.05 threshold, we can reject the Null hypothesis Hₒ, which means that our new design did perform significantly differently (let alone better) than our old app feature.Additionally, if we look at the confidence interval for the treatment group ([0.48, 0.533], or 0.48-53.3%) we notice that:* It includes our baseline value of a 7% conversion rate* It is above our target value of 10% (the 3% uplift we were aiming for)What this means is that it is more likely that the true conversion rate of the new app feature is better than our baseline, and rather than the 10% target we had hoped for. This is further proof that our new design targeting "young customers" is likely to be an improvement on our old design of the marketing campaign. <br> In conclusion, customers who are in their twenties and thirties of age are more likely to convert.Since our p-value=0.90% is below our α=0.05 threshold, we can reject the Null hypothesis Hₒ, which means that our new design did perform significantly differently (let alone better) than our old app feature.
Additionally, if we look at the confidence interval for the treatment group ([0.48, 0.533], or 0.48-53.3%) we notice that:
What this means is that it is more likely that the true conversion rate of the new app feature is better than our baseline, and rather than the 10% target we had hoped for. This is further proof that our new design targeting "young customers" is likely to be an improvement on our old design of the marketing campaign.
In conclusion, customers who are in their twenties and thirties of age are more likely to convert.